683 research outputs found

    Do unbalanced data have a negative effect on LDA?

    Get PDF
    For two-class discrimination, Xie and Qiu [The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition 40 (2) (2007) 557–562] claimed that, when covariance matrices of the two classes were unequal, a (class) unbalanced data set had a negative effect on the performance of linear discriminant analysis (LDA). Through re-balancing 10 real-world data sets, Xie and Qiu [The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition 40 (2) (2007) 557–562] provided empirical evidence to support the claim using AUC (Area Under the receiver operating characteristic Curve) as the performance metric. We suggest that such a claim is vague if not misleading, there is no solid theoretical analysis presented in Xie and Qiu [The effect of imbalanced data sets on LDA: a theoretical and empirical analysis, Pattern Recognition 40 (2) (2007) 557–562], and AUC can lead to a quite different conclusion from that led to by misclassification error rate (ER) on the discrimination performance of LDA for unbalanced data sets. Our empirical and simulation studies suggest that, for LDA, the increase of the median of AUC (and thus the improvement of performance of LDA) from re-balancing is relatively small, while, in contrast, the increase of the median of ER (and thus the decline in performance of LDA) from re-balancing is relatively large. Therefore, from our study, there is no reliable empirical evidence to support the claim that a (class) unbalanced data set has a negative effect on the performance of LDA. In addition, re-balancing affects the performance of LDA for data sets with either equal or unequal covariance matrices, indicating that having unequal covariance matrices is not a key reason for the difference in performance between original and re-balanced data

    Short note on two output-dependent hidden Markov models

    Get PDF
    The purpose of this note is to study the assumption of mutual information independence", which is used by Zhou (2005) for deriving an output-dependent hidden Markov model, the so-called discriminative HMM (D-HMM), in the context of determining a stochastic optimal sequence of hidden states. The assumption is extended to derive its generative counterpart, the G-HMM. In addition, state-dependent representations for two output-dependent HMMs, namely HMMSDO (Li, 2005) and D-HMM, are presented

    Learning Mixtures of Gaussians in High Dimensions

    Full text link
    Efficiently learning mixture of Gaussians is a fundamental problem in statistics and learning theory. Given samples coming from a random one out of k Gaussian distributions in Rn, the learning problem asks to estimate the means and the covariance matrices of these Gaussians. This learning problem arises in many areas ranging from the natural sciences to the social sciences, and has also found many machine learning applications. Unfortunately, learning mixture of Gaussians is an information theoretically hard problem: in order to learn the parameters up to a reasonable accuracy, the number of samples required is exponential in the number of Gaussian components in the worst case. In this work, we show that provided we are in high enough dimensions, the class of Gaussian mixtures is learnable in its most general form under a smoothed analysis framework, where the parameters are randomly perturbed from an adversarial starting point. In particular, given samples from a mixture of Gaussians with randomly perturbed parameters, when n > {\Omega}(k^2), we give an algorithm that learns the parameters with polynomial running time and using polynomial number of samples. The central algorithmic ideas consist of new ways to decompose the moment tensor of the Gaussian mixture by exploiting its structural properties. The symmetries of this tensor are derived from the combinatorial structure of higher order moments of Gaussian distributions (sometimes referred to as Isserlis' theorem or Wick's theorem). We also develop new tools for bounding smallest singular values of structured random matrices, which could be useful in other smoothed analysis settings

    Microstructure Effects on Daily Return Volatility in Financial Markets

    Full text link
    We simulate a series of daily returns from intraday price movements initiated by microstructure elements. Significant evidence is found that daily returns and daily return volatility exhibit first order autocorrelation, but trading volume and daily return volatility are not correlated, while intraday volatility is. We also consider GARCH effects in daily return series and show that estimates using daily returns are biased from the influence of the level of prices. Using daily price changes instead, we find evidence of a significant GARCH component. These results suggest that microstructure elements have a considerable influence on the return generating process.Comment: 15 pages, as presented at the Complexity Workshop in Aix-en-Provenc

    D-optimal designs via a cocktail algorithm

    Get PDF
    A fast new algorithm is proposed for numerical computation of (approximate) D-optimal designs. This "cocktail algorithm" extends the well-known vertex direction method (VDM; Fedorov 1972) and the multiplicative algorithm (Silvey, Titterington and Torsney, 1978), and shares their simplicity and monotonic convergence properties. Numerical examples show that the cocktail algorithm can lead to dramatically improved speed, sometimes by orders of magnitude, relative to either the multiplicative algorithm or the vertex exchange method (a variant of VDM). Key to the improved speed is a new nearest neighbor exchange strategy, which acts locally and complements the global effect of the multiplicative algorithm. Possible extensions to related problems such as nonparametric maximum likelihood estimation are mentioned.Comment: A number of changes after accounting for the referees' comments including new examples in Section 4 and more detailed explanations throughou

    A Bayesian reassessment of nearest-neighbour classification

    Get PDF
    The k-nearest-neighbour procedure is a well-known deterministic method used in supervised classification. This paper proposes a reassessment of this approach as a statistical technique derived from a proper probabilistic model; in particular, we modify the assessment made in a previous analysis of this method undertaken by Holmes and Adams (2002,2003), and evaluated by Manocha and Girolami (2007), where the underlying probabilistic model is not completely well-defined. Once a clear probabilistic basis for the k-nearest-neighbour procedure is established, we derive computational tools for conducting Bayesian inference on the parameters of the corresponding model. In particular, we assess the difficulties inherent to pseudo-likelihood and to path sampling approximations of an intractable normalising constant, and propose a perfect sampling strategy to implement a correct MCMC sampler associated with our model. If perfect sampling is not available, we suggest using a Gibbs sampling approximation. Illustrations of the performance of the corresponding Bayesian classifier are provided for several benchmark datasets, demonstrating in particular the limitations of the pseudo-likelihood approximation in this set-up

    Research informed sustainable development through art and design pedagogic practices

    Get PDF
    This paper explores a pedagogic case study, which embeds academic research activity into a masters level unit of study. Students were invited to work alongside the LiFE ‘Living in Future Ecologies’ research group at Manchester School of Art to collaboratively investigate themes for sustainable development within a city context. Pomona Island, a brownfield site on the boarders of Manchester, Salford and Trafford presented a context for complex issues of local government, and questions of international relevance on resilience and responsible urban planning. Through learning about the landscape and sensitive ecology of the island, students and researchers explored notions of context, climate, visions for future living, the opportunities and the responsibility of art and design practices in steering social reasoning within a neoliberal system. This paper presents a carefully considered enquiry-based framework, analysing academic questioning that has enabled the transformation of the ephemeral and immaterial into a methodology to address misguided political agendas. The paper articulates the different methods used to embed research practice in the learning environment. This type of project also fully illustrates innovative learning and teaching methods as ways in which art and design practices can uniquely engage with and stimulate thinking to influence and nurture change. Through presenting responses from a psychogeographical walk for Manchester European City of Science in July 2016, a conversational, transformative tool for learning was developed. Reflections on the project further evaluate the multi-disciplinary interpretations, already collated in a collaborative publication with the Pomona community and publisher Gaia Project

    An approximate Bayesian marginal likelihood approach for estimating finite mixtures

    Full text link
    Estimation of finite mixture models when the mixing distribution support is unknown is an important problem. This paper gives a new approach based on a marginal likelihood for the unknown support. Motivated by a Bayesian Dirichlet prior model, a computationally efficient stochastic approximation version of the marginal likelihood is proposed and large-sample theory is presented. By restricting the support to a finite grid, a simulated annealing method is employed to maximize the marginal likelihood and estimate the support. Real and simulated data examples show that this novel stochastic approximation--simulated annealing procedure compares favorably to existing methods.Comment: 16 pages, 1 figure, 3 table

    Quantitative assessment of sewer overflow performance with climate change in northwest England

    Get PDF
    Changes in rainfall patterns associated with climate change can affect the operation of a combined sewer system, with the potential increase in rainfall amount. This could lead to excessive spill frequencies and could also introduce hazardous substances into the receiving waters, which, in turn, would have an impact on the quality of shellfish and bathing waters. This paper quantifies the spilling volume, duration and frequency of 19 combined sewer overflows (CSOs) to receiving waters under two climate change scenarios, the high (A1FI), and the low emissions (B1) scenarios, simulated by three global climate models (GCMs), for a study catchment in northwest England. The future rainfall is downscaled, using climatic variables from HadCM3, CSIRO and CGCM2 GCMs, with the use of a hybrid generalized linear–artificial neural network model. The results from the model simulation for the future in 2080 showed an annual increase of 37% in total spill volume, 32% in total spill duration, and 12% in spill frequency for the shellfish water limiting requirements. These results were obtained, under the high emissions scenario, as projected by the HadCM3 as maximum. Nevertheless, the catchment drainage system is projected to cope with the future conditions in 2080 by all three GCMs. The results also indicate that under scenario B1, a significant drop was projected by CSIRO, which in the worst case could reach up to 50% in spill volume, 39% in spill duration and 25% in spill frequency. The results further show that, during the bathing season, a substantial drop is expected in the CSO spill drivers, as predicted by all GCMs under both scenarios
    corecore